Processing Sequences

Thus far in our course we have done both classification and regression analyses. For our classification, the data has been one of two types:

  1. Static feature data: for example, the pulsar dataset. We have 8 individual features characterizing pulsars as well as background, and our challenge was to develop a model capable of classifying new instances based on those features. We used fully connected networks (FCNs) to perform this task. Other tools would work well here, such as SVMs, decision trees, or random forests.
  2. Image data: Here we have predominantly used the MNIST dataset. Here our challenge was again to classify our data as one of multiple possible classes (0-9). We used both FCNs as well as convolutional neural networks (CNNs).

The primary difference in the FCN vs CNN approach is the following:

In this workbook we will deal with a different sort of dataset: ordered sequences. We will focus initially on classification of time sequences, but will extend this to classification of text sequences.

This workbook is based on examples and code from these sources:

  1. Human Activity Recognition (HAR) Tutorial with Keras and Core ML (Part 1) by Nils Ackermann
  2. Introduction to 1D Convolutional Neural Networks in Keras for Time Sequences by Nils Ackermann
  3. How to Develop 1D Convolutional Neural Network Models for Human Activity Recognition by Jason Brownlee

HAR: Human Activity Recognition

The analysis we will do in this workbook deals with data collected from smartphones carried by human subjects engaged in normal daily activities: walking, sitting, jogging, standing, climbing upstairs, or descending downstairs. The smartphones were carried in the front pants pocket by 36 subjects. The raw data collected are the accelerometer readings in the x,y, and z directions (relative to the smartphones) collected at 50 Hz (or 50 steps/second).

Given the phone orientation when a person is standing:

A paper describing the results can be found here.

A video of the activity can be found here.

Get the data:

The data can be found here: http://www.cis.fordham.edu/wisdm/dataset.php

I have placed this in the project area on OSC at this location: /fs/ess/PAS2038/PHYSICS5680_OSU/data/WISDM/WISDM_ar_v1.1/WISDM_ar_v1.1_raw.txt

There is a text file in the same location that has more descriptive information about the data.

The columns in the above files are the following:

Let's read the data in so we can explore it.

Simple plots

Lets make some simple plots which show how many samples are in the data for:

Splitting the data

We need to split the data into test and train data. The way we will do this is to see how much time each user spends in each activity. We will use defaultdicts to keep track of this.

Also, since each "step" in time is 1/50.0 of a second, we divide by 50 to convert steps to seconds (so activityStepsByUser is actual activity by user in seconds).

Be patient! This next code block takes a minute or so. There are 1.1M lines in the data file!

Printout

Now we can printout how much time each user spends in each activity. We see that some of the participants don't spend any time in some of the activites. we also note that if we order the particpants by user-id, that approximately every 4 users corresponds to about 10% of the data.

Also note that the amount of time each user spends in each ativity is typically about 50-60s.

Data Split

We might want to keep all of a given user's data in either the test or train sample - this way we use data from different people to predict behavior of new people.

If we want a 80%/20% split, it looks like we can define:

We also need to convert the 'activity' column from text ('Jogging', etc) to a number (1-6) so we can one-hot encode it later.

Normalizing feature data

As usual, we want to normalize our features. We will use training set maximums to do this.

Plotting

Let's make some plots to see what the data looks like. We will select out various activites from our primary dataframe df, and see if they make sense:

Assignment Task 1

Make plots for:

Creating our labeled data samples

From our above plots and tables, we see that typical times that each user spends in a given activity is tens of seconds or hundreds of seconds. The smallest non-zero time is about 11 seconds, the largest amount of time is just under 350 seconds.

We will define our samples to be 1.6 seconds, and we will assign as the label of each sample the activity that occurs the most in that sample. To help increase the total number of samples, we can also allow some overlap in our samples. We will not allow any overlap in our test data however.

Classifying Time Sequences Using a Fully Connected Network

Let's begin with something reasonably simple. We have labeled samples, each 80 time steps long, with 3 channels of information at each time step. Let's treat this as $80\times3=240$ total features. We can easily write down a multi-layer network with a softmax output to classify this.

Our network will look like the figure below: har_fcn_fig “Deep Neural Network Example” by Nils Ackermann is licensed under Creative Commons CC BY-ND 4.0

Fit the FCN

Here we use the callbacks option to monitor the validation loss, waiting patience=4 epochs before deciding to stop training. We also save the best model.

Why did we call model_m.fit two times in a row?

Can you figure out why we called model_m.fit twice in the code above? Look at the difference ion the two calls, as well as the printout to see if you can come up with a reason why you might want to do that!

Convolutional Neural Networks for Sequences

The performance we obtained for a single train/test split was about 71% (evaluated where the validation losss was at a minimum). Can we do better?

We recall that in our study of images, we found that 2D convolution could help - this was because there were features present in the data that were somewhat translationally and rotationally invariant. The convolution process also allows us to discover features, rather than imposing the features from the outside. We can also use the convolution process with sequences. In this case however, we will employ 1D convolution.

First lets define the network using Keras, then we will step through the layers to see if we can understand each of the steps.

Conv1D Details!

To understand how this works, refer to the figure below. In most respects this sort of network is extremely similar to the 2D concolutinal networks we use for images, though there are some subtle differences.

  1. First, it is important to remember that our data is 80 steps in time, and there are 3 channels at each step. The convolution moves across the time dimension (which is why it is 1d).
  2. In both of the figures on the left underneath the first "Conv Layer" heading, time moves from top to bottom.
  3. The first convolutional layer has a kernel size of 10, and there are 100 such kernels (or filters). Note that since our input data has 3 channels, our kernels also have 3 channels (so the kernels are really size 10x3). In our case, with a kernel size of 10, we convolve this across 10 time steps and 3 chanels at each time step, outputting one number, which can be thought of as the weighted average across the 3 channels and 10 time steps.
  4. Since we have a default stride of 1, we move ahead 1 time step and do the same calculation. Since we have 80 time steps, we end up with 71 total such calculations for each kernel. Since we have 100 kernels (or filters), we have an output size for the first convolutional layer of $71x100$.
  5. The next convolutional layers proceed in the same way, and the max pooling layer behaves just like the corresponding from our image classification networks.
  6. Another tricky aspect of this network is a new layer called the GlobalAveragePooling1D. This turns out to be very helpful to reduce overfitting. The motivation for it can be found in this paper: Network in Network.
  7. Finally, there is a dropout layer. This layer is also used to help prevent overfitting. The idea is the during training a random set of the neurons from the dropout layer have their outputs forced to zero. After training, the neurons are no longer set to zero. For background on this idea, see the original paper where this was proposed: Dropout: A Simple Way to Prevent Neural Networks from Overfitting. There is an interesting motivating comment in the paper: "A closely related, but slightly different motivation for dropout comes from thinking about successful conspiracies. Ten conspiracies each involving five people is probably a better way to create havoc than one big conspiracy that requires fifty people to all play their parts correctly. If conditions do not change and there is plenty of time for rehearsal, a big conspiracy can work well, but with non-stationary conditions, the smaller the conspiracy the greater its chance of still working."

har_fcn_fig.

Run the fitter!

We run the fitter in exactly the same way as we did the FCN above.

Multi-headed Network

We see from above that the 1D convolutional neural network greatly outperforms the standard FCN. Awesome! But if you notice, a key parameter in the network is the kernel size. We can see from our plots above comparing jogging/walking/sittig/climbing stairs, that the time structure of the various activities is different. Meaning that a larger kernal size - that covers more time steps - might make more sense for some activities, while a smaller kernel size might be more appropriate for other activities. Can we combine these in a single network? Yes! Enter the multi-headed network!

The basic idea is simple:

  1. We have 2 (or more) networks that get the same input.
  2. These two networks are then merged and produce the same softmax output as our above network.

To implement this network, it is necessary to use the Keras Functional API, which we have already introduced. Before we do the multi-headed network, let's do a copy of the above CNN network using the functional API:

Now the true multi-head network

To make this network we remove the 2 convolutional layers at the end of the model - just to cut down on the total number of trainable parameters (to help reduce overfitting). The structure of the two "heads" is exactly the same, except for the kernel size. It is not necessary that the two networks be so similar - this is just done for convenience.

We then merge the two heads using a concatenate layer, and send the output of that through a softmax to get our final output. There are alot of choices that can be made here: kernel sizes, number of kernels, number of convolutional layers, number of heads, amount of dropout, etc.

Assignment Task #2

Design a 3-head network. Try 3 different kernel sizes for the 3rd head: 5, 15, 25 (and using the same kernel size for eahc of the 2 conv1D layers as we do for the first 2 heads above). Which is best?

It would be a good idea (but is not required) to do this in a loop, varying the kernel size of the 3rd head.

Extra Credit:

  1. Plot the validation losss on the same plot for the 3 different kernal choices for the 3-head network.
  2. Make a confusion matrix for each of the 3 different kernal choices for the 3-head network.